Semester Project Report - Ehrenreich Collection Segmentation

Author

Hugo Demule

Published

January 5, 2026

Introduction

The Ehrenreich Collection is a unique archive of private opera recordings amassed by the New York opera enthusiast Leroy Ehrenreich, documenting live performances from major opera venues between 1965 and 2010. It contains thousands of hours of bootleg recordings, including live captures, radio tapes, and some commercial sources, reflecting over four decades of vocal performance, repertoire, and interpretive practice. (Bern University of Applied Sciences 2024) (Hochschule der Künste Bern 2021)

Since 2018, the Hochschule der Künste Bern (Bern Academy of the Arts) has been digitizing, cataloguing, and researching the collection within the project Ehrenreich Collection — Identity, Voice, Sound. This work not only preserves these rare acoustic documents but also supports long-term study of cultural, interpretative, and reception phenomena in live opera, exploring such aspects as performance variation, audience sound, and the broader context of opera bootlegging culture.

The collection serves as an important resource for musicological and computational research, opening perspectives on opera interpretation and facilitating efforts, like those in this project, to analyze and segment recordings using automated audio processing techniques.

Even with such a resourceful collection, studying operas can be fastidious. One first thing that musicologists might want to do would be to segment an opera piece into its corresponding movements. Such a segmentation hasn’t been done for the Ehrenreich collection and it’s all what this project is about.

Tackling this problem requires thinking of different axes of research and evaluating each one’s tradeoff. Two complementary approaches were explored:

  1. Audio-only segmentation, relying solely on acoustic features extracted from the opera recordings.
    1. Basic energy-based methods (silence and applause detection)
    2. Novelty-curve based methods using various audio features (chromagram, MFCCs, tempogram)
  2. Segment alignment, where pre-existing opera segments are aligned to full-length recordings.

In addition to algorithmic development, a software application using PyQt6 (Riverbank Computing Limited 2021) was implemented to allow interactive exploration and comparison of all proposed methods. Video demonstrations of the application usage can be found under the ▶ Application Usage sections throughout the report.

Note: All the code in this report (imported from src) can be found at (Demule 2025).

from src.audio.audio_file import AudioFile
from src.audio.signal import Signal
from src.io.ts_annotation import TSAnnotations
signal: Signal = AudioFile("data/report_sample.wav ").load()
content_parts = TSAnnotations.load_annotations("data/transitions.csv")
* Loaded Audio Path 'data/report_sample.wav ' 
* Samples Number: 4083923 
* Sample Rate: 44100 Hz 
* Duration: 00:01:32 
* File Size: 7.79 MB

Note 2: This report uses a reference audio file named report_sample.wav for demonstration purposes. This file is a short example made from two songs and one applause segment. The first song is Mumbo Sugar by Arc de Soleil and the second one is the main theme of Princess Mononoke by Joe Hisaishi. The segment is constructed as follows:

  • 0s - 20.5s: Mumbo Sugar by Arc de Soleil
  • 20.5s - 23.5s: Silence
  • 23.5s - 37s: Mumbo Sugar sped up by a factor 2
  • 37s - 45s: Applause sound effect
  • 45s - 1:06s: Mumbo Sugar at normal speed
  • 1:06s - 1:32s: Princess Mononoke by Joe Hisaishi

This construction allows to have different types of transitions that tries to mimic the ones that can be found in operas (change of tempo, silence, applause, change of timbre, harmonic structure, etc.). We define the ground truth transitions at the following timestamps (in seconds): 22s, 41s, 1:06s.

The different “content” parts of the audio are colored on the plots to help visualize the segmentation results (transitions remain white).

Theoretical Background

This work is largely grounded in the framework presented by Müller (Müller 2015), which provides a unified view of feature extraction, novelty-based segmentation, and structural analysis.

1.a – Basic Segmentation Methods

This section presents the segmentation algorithms relying exclusively on audio features extracted from the opera recordings.

Silence Curve

Principle

Silence-based segmentation relies on the detection of low-energy regions in the audio signal, which often coincide with structural boundaries.

Implementation

Energy is computed over short-time frames and smoothed to obtain a silence curve.

from src.audio_features.builders import SilenceCurveBuilder
from src.audio_features.features import SilenceCurve

builder = SilenceCurveBuilder(silence_type="spectral", frame_length=44100, hop_length=22050)
silence_curve = builder.build(signal)
peaks = silence_curve.find_peaks(threshold=0.8, distance_seconds=5)
silence_curve.plot(original_signal=signal, time_annotations=content_parts, peaks=peaks, figsize=(8,8))

Silence curve showing low-energy regions in the audio signal

As it can be observed on the plot, the silence curve shows in its dark areas, the spots in the audios with the lowest sound amplitude.

Applause Curve (HRPS)

Principle

The Harmonic/Percussive technique originally developped by Fitzgerald, Derry Fitzgerald (Fitzgerald 2010) is a powerful technique allowing to separate an audio into two new ones, one containing harmonic components and the other percussive ones. This algorithm has been further developped to include a residual part in between the two components, capturing elements in a sound that could not be classified into harmonic nor percussive (Driedger, Müller, and Disch 2014). This new technique is called Harmonic-Residual-Percussive Separation (HRPS) and turns out to be very effective in finding applause in audios. The intuition behind why HRPS works in that situation is that applause is inherently percussive but also lasts a long time, making it sitting exactly in between the two categories.

Implementation

from src.audio_features.builders import HRPSBuilder
from src.audio_features.features import HRPS

builder = HRPSBuilder(L_h_frames=30, L_p_bins=100, beta=1.75, frame_length=4410, hop_length=2205, downsampling_factor=5)
hrps = builder.build(signal)
hrps.plot(original_signal=signal, time_annotations=content_parts, figsize=(8,8))
Computing STFT...Computed STFT                    

Applying Median for Harmonic Component (1/2)
Y.shape: (2206, 371)
Median filtering took 0.15 seconds
Applying Median for Percussive Component (2/2)
Median filtering took 0.59 seconds


Computing Masks...Computed Masks                    

Computing Inverse STFT for x_h (1/3)
Computing Inverse STFT for x_r (2/3)
Computing Inverse STFT for x_p (3/3)


Computing Local Energy for Harmonic Signal (1/3)
Computing Local Energy for Residual Signal (2/3)
Computing Local Energy for Percussive Signal (3/3)


N bins: 3

Silence curve showing low-energy regions in the audio signal

▶ Application Usage

The following demonstration video shows how to use the application to explore the basic segmentation methods presented in this section. After a module outputs a curve with its detected transitions, it is possible to add all transitions (right-click on the plot) of that module or only specific ones (by right-clicking on it) and press “add all transitions” or “add this transition” respectively. All added transitions are then displayed on the main timeline at the bottom of the application window which is the final segmentation output that can be exported.

1.b – Novelty Curve Segmentation Methods

This section presents segmentation methods based on novelty curves derived from various audio features. In this project, three features are explored for novelty curve computation: chromagram, mel-frequency cepstral coefficients (MFCCs), and tempogram. Each feature captures different aspects of the audio signal, providing complementary information for segmentation. The choice of not using Spectrogram-based novelty curves is motivated by their high dimensionality and computational cost, which can be very limiting, especially for a reactive application that needs to compute segmentations on-the-fly. Moreover, a spectrogram contains a lot of redundant information and possibly more noise that would make the segmentation less robust. (A REVOIR!!!)

The chromagram plays the role of capturing harmonic and tonal changes (similar to what a spectrogram would do but in a more compact way), the MFCCs capture timbral variations, and the tempogram focuses on rhythmic changes. By combining novelty curves from these diverse features, a more robust segmentation can be achieved, leveraging the strengths of each representation.

Novelty Curve: Chromagram

Principle

A chromagram represents how strong each of the twelve musical notes (C, C#, D, …, B) appears over time, regardless of which octave they’re played in. This makes it perfect for tracking harmonic changes and key transitions in music, since it focuses on the musical “color” rather than the exact pitch height.

The chromagram is computed by grouping frequency components into pitch classes:

C(n, p) = \sum_{f \in F_p} |X(n, f)|^2

where C(n, p) is the chroma value for time frame n and pitch class p, X(n, f) is the frequency content from the STFT, and F_p contains all frequency bins that correspond to pitch class p across different octaves.

Implementation

from src.audio_features.builders import ChromagramBuilder, SSMBuilder
from src.audio_features.features import Chromagram
import numpy as np

builder = ChromagramBuilder(frame_length=4410, hop_length=2205)
chromagram = builder.build(signal=signal)
chromagram = chromagram.normalize(norm='2', threshold=0.001)
chromagram = chromagram.smooth(filter_length=11, window_type='boxcar')
chromagram = chromagram.downsample(factor=5)
chromagram = chromagram.ensure_positive() # Important for log compression
chromagram = chromagram.log_compress(gamma=20)

ssm = SSMBuilder(
    smoothing_filter_length=1,
    smoothing_filter_direction='both',
    shift_set=np.array([0]),
    tempo_relative_set=np.array([1]),
).build(chromagram)

ssm = ssm.threshold(thresh=0.8)
ssm.plot(x_axis_type='time', original_base_feature=chromagram, time_annotations=content_parts)

(top) Chromagram computed from the raw signal showing the 12 pitch classes, (bottom) self-similarity matrix derived from this chromagram.
<Figure size 672x480 with 0 Axes>
from src.audio_features.features import NoveltyCurve

nc_chromagram: NoveltyCurve = ssm.compute_novelty_curve(kernel_size=20, variance=2, exclude_borders=True)

peaks = nc_chromagram.find_peaks(threshold=0.4, distance_seconds=30)
nc_chromagram.plot(x_axis_type='time', novelty_name='Chromagram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

print("time_annotations:", content_parts)
print("Type of time_annotations:", type(content_parts))

nc_chromagram_smooth = nc_chromagram.smooth(sigma=10)
nc_chromagram_smooth.plot(x_axis_type='time', novelty_name='Smoothed Chromagram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

Novelty curve computed from the chromagram self-similarity matrix. Peaks indicate high novelty points in terms of harmonic features, possibly corresponding to structural boundaries. The curve has been normalized using min-max normalization, creating some kind of probabilistic interpretation for transitions.
time_annotations: [(0.0, 20.5, 'A'), (23.5, 37.0, 'B'), (45.0, 66.0, 'C'), (45.0, 93.0, 'D')]
Type of time_annotations: <class 'list'>

As it can be observed, the highest peak is located at the 3rd transition (~1:06s) which corresponds to the change between the two songs, effectively representing a significant harmonic change.

Novelty Curve: MFCC

Principle

Mel-Frequency Cepstral Coefficients (MFCCs) capture the “color” or texture of sound—what makes a piano sound different from a violin playing the same note. They work by mimicking how our ears naturally process sound.

The human ear doesn’t hear all frequencies equally. We’re more sensitive to differences in low frequencies than high ones. MFCCs replicate this by using the Mel scale, which spaces frequencies the same way our ear perceives them. Additionally, our hearing works logarithmically—the difference between 100Hz and 200Hz sounds similar to the difference between 1000Hz and 2000Hz. MFCCs apply this logarithmic compression to match our natural hearing.

The computation follows our ear’s processing steps:

STFT → Mel filterbank → log compression → DCT

In opera recordings, MFCCs capture the overall sonic texture created by the blend of voices and orchestra. When the instrumentation changes, when a new singer enters, or when the recording conditions shift, MFCCs detect these timbral transitions effectively, making them excellent for finding section boundaries based on “how the music sounds” rather than “what notes are played.”

\mathrm{MFCC}(n,k) = \sum_{m=1}^{M} \log\!\left( \sum_{f} |X(n,f)|^2 \, H_m(f) \right) \cos\!\left( \frac{\pi k}{M}\left(m - \frac{1}{2}\right) \right)

where:

\mathrm{MFCC}(n, k) is the k-th MFCC coefficient at frame n,

X(n,f) is the Short-Time Fourier Transform (STFT) of the signal,

H_m(f) is the m-th Mel filterbank,

M is the number of Mel filters,

k is the cepstral coefficient index (typically k=0,\dots,K-1).

Implementation

from src.audio_features.builders import MFCCBuilder, SSMBuilder
from src.audio_features.features import MFCC
import numpy as np

builder = MFCCBuilder(n_mfcc=12, frame_length=4410, hop_length=2205)
mfcc = builder.build(signal=signal)
mfcc = mfcc.normalize(norm='2', threshold=0.001)
mfcc = mfcc.smooth(filter_length=11, window_type='boxcar')
mfcc = mfcc.downsample(factor=5)
mfcc = mfcc.ensure_positive() # Important for log compression
# mfcc = mfcc.log_compress(gamma=20)

ssm = SSMBuilder(
    smoothing_filter_length=1,
    smoothing_filter_direction='both',
    shift_set=np.array([0]),
    tempo_relative_set=np.array([1]),
).build(mfcc)

ssm = ssm.threshold(thresh=0.8)
ssm.plot(x_axis_type='time', original_base_feature=mfcc, time_annotations=content_parts)
Applied offset of 1.0000000100000002 to ensure positive values

(top) MFCCs computed from the raw signal showing the 12 coefficients, (bottom) self-similarity matrix derived from these MFCCs.
<Figure size 672x480 with 0 Axes>
from src.audio_features.features import NoveltyCurve

nc_mfcc: NoveltyCurve = ssm.compute_novelty_curve(kernel_size=20, variance=2, exclude_borders=True)

peaks = nc_mfcc.find_peaks(threshold=0.2, distance_seconds=1)
nc_mfcc.plot(x_axis_type='time', novelty_name='MFCC Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

nc_mfcc_smooth = nc_mfcc.smooth(sigma=10)
nc_mfcc_smooth.plot(x_axis_type='time', novelty_name='Smoothed MFCC Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

Novelty curve computed from the MFCC self-similarity matrix. Peaks indicate high novelty points in terms of timbral features, possibly corresponding to structural boundaries. The curve has been normalized using min-max normalization, creating some kind of probabilistic interpretation for transitions.

On this plot, the highest peak is located at the 2nd transition (~41s) which corresponds to the applause segment that introduces a significant timbral change in the audio. The other two transitions are also detected but with less intensity.

Novelty Curve: Tempogram

Principle

A tempogram reveals the rhythmic patterns and tempo changes in music over time. It works by finding repeating patterns in when musical events (like notes or beats) occur, helping identify tempo shifts and rhythmic transitions.

The computation process follows two main steps:

  1. Onset Detection: First, we identify when musical events happen by computing an onset strength envelope that highlights note attacks and rhythmic events.

  2. Pattern Analysis: Then, we analyze these onset patterns using localized autocorrelation to find repeating rhythmic cycles at different time scales.

Mathematically, the tempogram is computed as:

T(n, \tau) = \sum_{k=0}^{w-1} O(n+k) \cdot O(n+k+\tau)

where T(n, \tau) is the tempogram value at time frame n and lag \tau, O(n) is the onset strength at frame n, and w is the analysis window length.

The resulting tempogram has shape (win_length, n_time_frames), where each row corresponds to a different tempo and high values indicate strong rhythmic activity at that tempo. This makes it excellent for detecting tempo changes, applause sections, and rhythmic transitions in opera recordings.

Implementation

from src.audio_features.builders import TempogramBuilder, SSMBuilder
from src.audio_features.features import Tempogram
import numpy as np

builder = TempogramBuilder(frame_length=4410, hop_length=2205)
tempogram = builder.build(signal=signal)
tempogram = tempogram.normalize(norm='2', threshold=0.001)
tempogram = tempogram.smooth(filter_length=11, window_type='boxcar')
tempogram = tempogram.downsample(factor=5)
tempogram = tempogram.ensure_positive() # Important for log compression
tempogram = tempogram.log_compress(gamma=20)

ssm = SSMBuilder(
    smoothing_filter_length=1,
    smoothing_filter_direction='both',
    shift_set=np.array([0]),
    tempo_relative_set=np.array([1]),
).build(tempogram)

ssm = ssm.threshold(thresh=0.8)
ssm.plot(x_axis_type='time', original_base_feature=tempogram, time_annotations=content_parts)
Applied offset of 1.0000000024632148e-08 to ensure positive values

(top) Tempogram computed from the raw signal showing the rhythmic patterns, (bottom) self-similarity matrix derived from this tempogram.
<Figure size 672x480 with 0 Axes>
from src.audio_features.features import NoveltyCurve

nc_tempogram: NoveltyCurve = ssm.compute_novelty_curve(kernel_size=20, variance=2, exclude_borders=True)

peaks = nc_tempogram.find_peaks(threshold=0.4, distance_seconds=10)
nc_tempogram.plot(x_axis_type='time', novelty_name='Tempogram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

nc_tempogram_smooth = nc_tempogram.smooth(sigma=10)
nc_tempogram_smooth.plot(x_axis_type='time', novelty_name='Smoothed Tempogram Novelty Curve', peaks=peaks, time_annotations=content_parts, figsize=(8, 4))

Novelty curve computed from the tempogram self-similarity matrix. Peaks indicate high novelty points in terms of rhythmic features, possibly corresponding to structural boundaries. The curve has been normalized using min-max normalization, creating some kind of probabilistic interpretation for transitions.

Here, the highest peak is located at the 1st transition (~22s) which corresponds to the most abrupt change of tempo (normal speed to double speed). The second transition (~41s) also shows a peak but less pronounced. The possible reason is the applause part that adds some noisy rhythmic content that could confuse the tempogram. The last transition (~1:06s) is also detected since the tempo also changes between the two songs.

Note: In this example, the tempogram seems to be the most effective feature for detecting tempo changes, but in real opera recordings, its performance tends to be less reliable than the two other features (chromagram and MFCC) due to the complex and varying rhythmic structures present in operatic music (see Results section for more details).

Novelty Curve: Combination of Features

Principle

The combination approach leverages the complementary strengths of different audio features by intelligently merging their individual novelty curves into a unified segmentation result. Rather than relying on a single feature type, this method recognizes that different musical aspects—harmonic changes (chromagram), timbral variations (MFCC), and rhythmic shifts (tempogram)—provide different but valuable information for detecting structural boundaries.

The combination process works in two main steps:

  1. Weighted Feature Integration: Each novelty curve is assigned a weight (0.0 to 1.0) that reflects its importance for the specific analysis.

  2. Mathematical Combination: The weighted curves are combined using one of two methods:

    • Mean combination: Computes the weighted average, requiring consensus across features
    • Max combination: Takes the maximum value, emphasizing the strongest transitions from any feature

Mathematically, for mean combination: NC_{combined}(t) = \frac{\sum_{i} w_i \cdot NC_i(t)}{\sum_{i} w_i}

And for max combination: NC_{combined}(t) = \max_i(w_i \cdot NC_i(t))

where w_i are the feature weights and NC_i(t) are the individual novelty curves.

Implementation

from src.audio_features.features import NoveltyCurve
import numpy as np

# Define combination parameters (optimized weights from research)
chromagram_weight = 1.0
mfcc_weight = 1.0
tempogram_weight = 1.0
combination_methods = ["mean", "max"]

# Collect available novelty curves and their weights
available_curves = [[nc_chromagram, nc_mfcc, nc_tempogram]]

curves = [
  ("normal", [nc_chromagram, nc_mfcc, nc_tempogram]),
  ("smoothed", [nc_chromagram_smooth, nc_mfcc_smooth, nc_tempogram_smooth]),
]
corresponding_weights = [chromagram_weight, mfcc_weight, tempogram_weight]

for method in combination_methods:
  for curve_name, available_curves in curves:
    # Combine novelty curves
    nc_combined = NoveltyCurve.combine(
        available_curves, 
        weights=corresponding_weights, 
        method=method
    )

    # Find peaks in combined curve
    peaks = nc_combined.find_peaks(threshold=0.5, distance_seconds=15)
    nc_combined.plot(
        x_axis_type='time', 
        novelty_name=f'Combined {curve_name.upper()} Novelty Curves using {method.upper()} method', 
        peaks=peaks, 
        time_annotations=content_parts, 
        figsize=(8, 4)
    )

The combined approach demonstrates superior performance by detecting transitions that individual features might miss while reducing false positives through the consensus mechanism. The weighted combination allows emphasizing the most reliable feature types for the specific musical content being analyzed.

We see that the mean approach captures all three transitions effectively while smoothing the false positives (e.g. the right peak detected by the chromagram inside the yellow band was effectively smoothed out).

The max approach also captures all three transitions but is more sensitive to false positives since it only requires one feature to signal a transition. This makes this method more aggressive, which can be beneficial in some contexts but may also lead to over-segmentation.

Note: Optuna optimization shows combination approach significantly outperforms individual features (0.80 recall vs 0.58-0.69) while maintaining 0.51 precision.

▶ Application Usage

The following demonstration video shows how to use the application to explore the novelty curve segmentation methods presented in this section. After a module outputs a curve with its detected transitions, it is possible to add all transitions (right-click on the plot) of that module or only specific ones (by right-clicking on it) and press “add all transitions” or “add this transition” respectively. All added transitions are then displayed on the main timeline at the bottom of the application window which is the final segmentation output that can be exported.

On the video and from top left to bottom right, the following modules are shown: Chromagram novelty curve, MFCC novelty curve, Tempogram novelty curve, Combined novelty curve using mean method. As it can be observed, as the different features’ novelty curves are computed, the combined novelty curve module updates its output accordingly. At the end, we see see that the Chromagram and MFCC reduced the spikes of the Tempogram, keeping only the tempogram’s most confident transitions (which turned out to be true positive).

Also, we see that both Chromagram and MFCC novelty curves share a high peak on the right side, outputting, by consensus, a very confident transition in the combined novelty curve (which is a real transition in the example).

2 – Segment Alignment

In contrast to audio-only segmentation, this approach assumes the availability of external opera segments.

Chromagram-based Alignment

Each segment and the full opera recording are represented using chromagrams.
Alignment is performed by maximizing similarity over time shifts.

Implementation

First, we load the main audio track. In this example, we only take the first 500 seconds of Giulio Cesare by Handel (BAR103, track 2, channel 2).

from src.audio.audio_file import AudioFile
from src.audio.signal import Signal
from src.io.ts_annotation import TSAnnotations
import numpy as np

ehrenreich_audio_filepath = "data/alignment/bar103-t2-c2-0-900.wav"
ehrenreich_ground_truths_filepath = "data/alignment/bar103-t2-c2-timestamps.txt"

# Load Ehrenreich audio (500 first seconds) and ground truths
ehrenreich_signal: Signal = AudioFile(ehrenreich_audio_filepath).load().subsignal(0, 500)
ehrenreich_ground_truths = TSAnnotations.load_transitions_txt(ehrenreich_ground_truths_filepath)
* Loaded Audio Path 'data/alignment/bar103-t2-c2-0-900.wav' 
* Samples Number: 43200000 
* Sample Rate: 48000 Hz 
* Duration: 00:15:00 
* File Size: 164.79 MB

After that, we load the three first preview of this same opera from the naxos database, namely Overture, Act I Scene 1: Caesar! Caesar! Egypt acclaims thee (Chorus) and Act I Scene 1: Kneel in tribute, fair land of Egypt (Caesar). from that, knowing the duration of the whole piece (Ehrenreich) and the one from each segment, we can perform a temporal interpolation estimating the preview start within the big piece. This gives us a good first approximation that we later refine using the chromagram alignment algorithm.

import os
from src.audio_features.aligners import ChromagramAligner

# Load Naxos preview signals from directory
naxos_previews_dir = "data/alignment/previews"
naxos_signals = []

for filename in os.listdir(naxos_previews_dir):
    if filename.endswith(".wav"):
        filepath = os.path.join(naxos_previews_dir, filename)
        naxos_signal: Signal = AudioFile(filepath).load()
        naxos_signals.append(naxos_signal)

print(f"Loaded {len(naxos_signals)} Naxos preview signals.\n")

# Load duration data for temporal interpolation
naxos_durations_filepath = "data/alignment/audio_full_durations.npy"
naxos_preview_durations = np.load(naxos_durations_filepath, allow_pickle=True)

# Calculate cumulative start times (original Naxos timeline)
naxos_cumulative_starts = np.cumsum(naxos_preview_durations)
naxos_total_duration = naxos_cumulative_starts[-1]  # Total duration of all previews
naxos_preview_starts = np.insert(naxos_cumulative_starts, 0, 0)[:-1]  # Insert 0 at start, remove last

print("Original Naxos preview timeline:")
for i, (start, duration) in enumerate(zip(naxos_preview_starts, naxos_preview_durations)):
    print(f"     - Preview {i+1}: starts at {start:.1f}s (duration: {duration:.1f}s)")

# Temporal interpolation: map Naxos timeline to Ehrenreich signal duration
interpolated_starts = [
    ChromagramAligner.get_relative_time(
        time_sec=start, 
        total_duration=naxos_total_duration, 
        reference_duration=ehrenreich_signal.duration_seconds()
    )
    for start in naxos_preview_starts
]

print(f"\nInterpolated timeline (mapped to {ehrenreich_signal.duration_seconds():.1f}s Ehrenreich signal):")
for i, start in enumerate(interpolated_starts):
    print(f"     - Preview {i+1}: estimated start at {start:.1f}s")
* Loaded Audio Path 'data/alignment/previews\30s_preview_01.wav' 
* Samples Number: 1323008 
* Sample Rate: 44100 Hz 
* Duration: 00:00:30 
* File Size: 2.52 MB
* Loaded Audio Path 'data/alignment/previews\30s_preview_02.wav' 
* Samples Number: 1323008 
* Sample Rate: 44100 Hz 
* Duration: 00:00:30 
* File Size: 2.52 MB
* Loaded Audio Path 'data/alignment/previews\30s_preview_03.wav' 
* Samples Number: 1323008 
* Sample Rate: 44100 Hz 
* Duration: 00:00:30 
* File Size: 2.52 MB
Loaded 3 Naxos preview signals.

Original Naxos preview timeline:
     - Preview 1: starts at 0.0s (duration: 182.0s)
     - Preview 2: starts at 182.0s (duration: 98.0s)
     - Preview 3: starts at 280.0s (duration: 125.0s)

Interpolated timeline (mapped to 500.0s Ehrenreich signal):
     - Preview 1: estimated start at 0.0s
     - Preview 2: estimated start at 224.7s
     - Preview 3: estimated start at 345.7s

After loading all data, we convert raw audio into chromagram using the same processing parameters.

from src.audio_features.builders import ChromagramBuilder
from src.audio_features.features import Chromagram

def preprocess(chroma: Chromagram) -> Chromagram:
    ch_p = chroma.normalize("2")
    ch_p = ch_p.smooth(21)
    ch_p = ch_p.log_compress(500)
    return ch_p


ehrenreich_chroma: Chromagram = ChromagramBuilder().build(ehrenreich_signal)
ehrenreich_chroma: Chromagram = preprocess(ehrenreich_chroma)
ehrenreich_chroma.plot(figsize=(9, 2), title_override="Ehrenreich Chromagram")

naxos_chromas = []
for i, naxos_signal in enumerate(naxos_signals):
    naxos_chroma: Chromagram = ChromagramBuilder().build(naxos_signal)
    naxos_chroma: Chromagram = preprocess(naxos_chroma)
    naxos_chromas.append(naxos_chroma)
    naxos_chroma.plot(figsize=(9, 2), title_override=f"Naxos Preview Chromagram {i+1}")

Finally, we can start the chromagram alignment. Each preview chromagram is aligned with the reference (Ehrenreich chromagram) using optimized windows of research. Those windows are approximated using the temporal interpolation mentioned just before. This avoids trying to align the opening of an opera at the end of it, and reduces the time complexity algorithm drastically. For example, if an opera lasts three hours and we define the research window to be 15 minutes, we divide the time complexity by 12.

import numpy as np

# Initialize the chromagram aligner
aligner = ChromagramAligner(sigma=np.array([[2, 1], [1, 2], [1, 1]]))

transition_predictions = []

for i, naxos_chroma in enumerate(naxos_chromas):

    # Use the aligner to find the best alignment
    # The aligner expects (reference, query), so we pass chroma as ref and naxos_chroma as query
    start_s, end_s = aligner.align(
        ref=ehrenreich_chroma,
        query=naxos_chroma,
        expected_start_sec=interpolated_starts[i],
        window_size_sec=1200,
        use_gaussian_filter=True,
        filter_sigma=3,
        output_type="time",
        plot_cost_matrix=True,
    )

    # For transition detection, we use the start of the alignment
    transition_predictions.append(start_s)

print("All detected transitions (seconds):")
for i, transition in enumerate(transition_predictions):
    print(f"    * Preview {i+1} starts at {transition:.2f}s on Ehrenreich's track")

All detected transitions (seconds):
    * Preview 1 starts at 4.27s on Ehrenreich's track
    * Preview 2 starts at 188.02s on Ehrenreich's track
    * Preview 3 starts at 278.52s on Ehrenreich's track

Finally, we can compare the results found with the ground truths annotated from a score.

from src.metrics.metrics import TS_Evaluator

print(
    "Ehrenreich segment number:",
    len(ehrenreich_ground_truths),
    "| Preview segment number:",
    len(naxos_signals),
)

evaluator = TS_Evaluator(tolerance_seconds=15)
evaluation = evaluator.evaluate(ehrenreich_ground_truths, transition_predictions)
evaluator.plot_evaluation(ehrenreich_ground_truths, transition_predictions, figsize=(8, 4))
Ehrenreich segment number: 4 | Preview segment number: 3

The chromagram-based alignment demonstrates very solid performance in this example, achieving perfect precision (100%) by accurately aligning all detected Naxos preview segments with their corresponding positions in the Ehrenreich recording.

However, a fundamental challenge emerges from the inherent disagreement between different sources regarding segment boundaries. In this 8 minutes example, the Ehrenreich recording contains 4 structural movements, while the Naxos database provides only 3 corresponding preview segments for the same temporal range. This discrepancy, discussed in the Alignment Segmentation Results section, represents an unavoidable limitation when multiple sources define different segmentation schemes for the same musical work—a common occurrence in opera where no absolute structural truth exists. Consequently, one genuine transition remains undetected, resulting in a recall of 75% and an overall F1 score of 0.857.

Despite this inherent limitation, the alignment technique proves highly effective when both recordings maintain sufficient quality. However, practical constraints limit its broader applicability to the Ehrenreich Collection. Not all operas are represented in the Naxos database, and the bootleg nature of Ehrenreich recordings—captured illegally during live performances as described in the Introduction—often results in poor audio quality that can complicate reliable feature alignment.

The selection of chromagram features for alignment was motivated by extensive comparative analysis. Unlike spectrograms, which contain high-dimensional redundant information that increases computational complexity, chromagrams provide a compact 12-dimensional representation focusing on harmonic content. Compared to MFCCs, chromagrams demonstrate superior robustness across different opera versions and recording conditions, as MFCCs prove more sensitive to timbral variations introduced by recording quality differences, background noise, and acoustic environments—factors particularly relevant when comparing commercial studio recordings with bootleg live captures.

▶ Application Usage

The following demonstration video shows how to use the application to perform chromagram-based alignment. First, search for an opera on naxos.com in the catalogue section (https://www.naxos.com/Catalogue). From there, navigate to the opera page (https://www.naxos.com/CatalogueDetail/?id=…) and ensure the page contains playable audio previews, if not, the module will not be able to retreive anyting.

Then, paste this naxos catalogue URL (https://www.naxos.com/CatalogueDetail/?id=…) in the text input of the alignment module. This will automatically fetch all preview audio files from that opera. Once done, a table with playable previews should appear in the application with a blue button on the right side of the module named “Start Alignment”. By clicking on it, the alignment process will start and the table will be updated with the detected start times and end times for each preview within the main audio track.

By clicking on a row of the table, the red preview line will be teleported to the corresponding position in the main timeline at the bottom of the application window, allowing for quick audio comparison between the aligned preview and the main audio track.

To add one or multiple detected transitions to the final timeline, right-click on a row of the table and select “Add this transition” or “Add all transitions” respectively.

Parameters Overview

All algorithms presented use specific parameters that control their behavior and performance. This section provides a comprehensive reference for understanding and tuning these parameters.


Common Parameters

These parameters are shared across multiple segmentation methods:

Parameter Options Description/Impact
Frame Length 4410 - 88200 samples Analysis window size for feature extraction, controlling temporal vs frequency resolution trade-off. Larger frames (~2 seconds): Better frequency resolution, smoother curves, ideal for harmonic analysis but reduced temporal precision. Smaller frames (~0.1 seconds): Better temporal precision for rapid changes, noisier curves, reduced frequency resolution.
Hop Length 2205 - 44100 samples Step size between successive analysis windows, determining temporal granularity. Must be ≤ frame_length. Smaller values: Finer temporal resolution, higher computational cost. Larger values: Smoother curves, reduced computational load, may miss brief changes.
Threshold 0.0 - 1.0 Peak detection sensitivity for potential transitions. Higher values (≥0.7): Only prominent peaks, fewer false positives, may miss subtle boundaries. Lower values (≤0.4): More sensitive detection, higher recall, more false positives.
Min Distance Between Peaks 1.0 - 60.0 seconds Minimum temporal separation between detected transitions, prevents multiple detections for same boundary. Higher values: Distinct structural changes, may miss close sections. Lower values: Rapid changes detection, risk of over-segmentation.

Silence Curve Parameters

Parameter Options Description/Impact
Silence Type amplitude, spectral Energy computation method. amplitude: RMS energy in time domain, efficient for clear silences like pauses between movements. spectral: STFT-based energy, more sensitive to subtle frequency changes, can detect quiet sustained notes or orchestral texture changes.
Min Silence Duration 0.0 - 5.0 seconds Minimum duration for valid silence regions. Higher values: Reduce false positives by filtering brief pauses (breathing), may miss shorter boundaries. Lower values: More sensitive but risk over-segmentation from small pauses.

Applause Curve (HRPS) Parameters

The HRPS method targets applause detection through residual component analysis:

Parameter Options Description/Impact
L_h_frames 1-100 frames Harmonic component temporal smoothing window, controls coherence of harmonic elements. Larger values (≥50): Better isolation of sustained elements (notes/chords), may blur rapid harmonic changes. Smaller values (≤20): Preserve temporal detail, may introduce noise affecting separation quality.
L_p_bins 1-200 bins Percussive component frequency smoothing filter, determines percussive isolation across spectrum. Larger values (≥100): Capture broadband events like applause spanning multiple frequencies, may merge distinct sources. Smaller values (≤50): Preserve frequency resolution for isolated elements, may fragment broadband events.
Beta (β) 1.1-5.0 Separation strictness factor controlling harmonic/percussive/residual classification. β=1.0 gives standard harmonic-percussive decomposition. Higher β: More selective classification, more content in residual component. Values around 2.0 optimal for applause (mixed characteristics). Values >2.5 too aggressive.

Novelty Curve Parameters

Novelty-based methods (Chromagram, MFCC, Tempogram) share a complex processing pipeline:

Feature Processing Parameters

Parameter Options Description/Impact
Normalization 1, 2, max, z Feature normalization method. 1: L1 norm, preserves proportions. 2: L2 norm (best for chromagrams), creates unit vectors. max: Scales by peak value, preserves dynamics (good for MFCCs). z: Zero mean, unit variance standardization.
Smoothing Filter Length 1-51 frames (odd) Temporal smoothing for noise reduction and pattern enhancement. Longer filters (≥21): Smoother trajectories, reduced sensitivity to rapid fluctuations, may miss brief transitions. Shorter filters (≤11): Preserve temporal detail, may introduce noise in SSM computation.
Downsampling Factor 1-50 Complexity reduction while preserving structural information. Higher factors (≥20): Significant computation/memory reduction for long recordings, may lose fine temporal details. Lower factors (≤10): Preserve temporal resolution, higher computational cost, ideal for precise transitions.
Log Compression 0.0-50.0 Gamma value for dynamic range compression, enhances weak components. Higher values (≥10): Stronger compression, emphasizes subtle variations indicating boundaries. Lower values (≤5): Preserve original dynamics, maintain contrasts but may miss subtle transitions.

Self-Similarity Matrix (SSM) Parameters

Parameter Options Description/Impact
SSM Smoothing Length 1-51 frames (odd) Smoothing filter for SSM noise reduction and diagonal structure enhancement. Longer filters: Create clearer block structures for musical segments, may merge close boundaries. Shorter filters: Preserve fine details, may introduce novelty curve noise.
SSM Smoothing Direction forward, backward, both Temporal smoothing direction. both: Bidirectional (most common), balanced results with enhanced structural patterns. forward/backward: Unidirectional, may create asymmetric effects for specific structures.
SSM Threshold 0.0-1.0 Similarity threshold for structural contrast enhancement. Higher thresholds (≥0.7): Sparser SSMs, only strong similarities, clearer boundaries but may miss subtle transitions. Lower thresholds (≤0.3): More similarity information, may introduce novelty curve noise.
Binarize SSM boolean Convert thresholded SSM to binary (0/1) vs continuous values. Creates sharper structural boundaries. Benefits chromagrams (sharp structural delineation), continuous better for MFCC/tempogram (preserved gradient information for optimal detection).

Novelty Computation Parameters

Parameter Options Description/Impact
Kernel Size 1-150 frames Gaussian kernel size for novelty detection, controls temporal window for structural changes. Larger kernels (≥50): Detect broader changes (major movements/sections), may miss local transitions. Smaller kernels (≤20): Sensitive to rapid changes, may create noisy curves with false positives.
Variance 0.1-100.0 Gaussian kernel spread parameter, controls detection window characteristics. Higher variance (≥10): Broader detection windows, emphasize gradual changes, reduce sensitivity to abrupt transitions. Lower variance (≤5): Sharper detection windows, better for precise moments, more susceptible to noise.
Smoothing Sigma 0.0-40.0 Final novelty curve Gaussian smoothing to reduce noise and enhance peaks. Higher sigma (≥15): Very smooth curves with clear broad peaks for major boundaries, may merge close transitions. Lower sigma (≤5): Preserve fine temporal details and rapid transitions, may retain noise affecting peak detection.

Combination Parameters

Controls how multiple novelty curves are merged:

Parameter Options Description/Impact
Chromagram Weight 0.0-1.0 Importance of harmonic transitions (key changes, chord progressions, tonal shifts). Optuna optimization consistently favors higher weights, suggesting harmonic information provides reliable opera boundary indicators.
MFCC Weight 0.0-1.0 Importance of timbral changes (instrumentation, vocal texture, recording characteristics). Also consistently receives higher weights in optimization.
Tempogram Weight 0.0-1.0 Importance of rhythmic shifts (tempo changes, meter variations, applause transitions). Receives lower weights as opera transitions often maintain rhythmic continuity despite harmonic/timbral changes.
Combination Method mean, max Mathematical fusion approach. mean: Weighted average requiring consensus across features, balanced results, reduces false positives, may miss single-feature transitions. max: Maximum value emphasis, more sensitive, higher recall for diverse transitions, potentially more false positives. Formula: Mean: NC_{combined}(t) = \frac{\sum_{i} w_i \cdot NC_i(t)}{\sum_{i} w_i}, Max: NC_{combined}(t) = \max_i(w_i \cdot NC_i(t))

▶ Application Usage

All modules in the application allow to modify those parameters and recompute the output curves on-the-fly. You can access them clicking on the gear icon on the bottom right corner of each module. The parameters can be reset to their default values by clicking on the Reset To Defaults button.

Methods Evaluation and Optimization

Metrics Used

In order to assess the efficacy of the presented methods, doing a proper evaluation is crucial. The first thing to do is selecting a good metric that would faithfully indicate whether a segmentation is considered good or not. For that, F1 score turns out to be the perfect match. Its rigorous formula is the following:

F_1=2\cdot\frac{Precision \cdot Recall}{Precision + Recall} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

with the Precision and Recall being:

Precision = \frac{TP}{TP + FP} Recall = \frac{TP}{TP + FN}

In our case, The precision tells how much an algorithm is reliable when it finds a transition. If the precision is 75%, it means that 75% of the time, the algorithm is effectively true about its prediction.

The recall computes the coverage of the algorithm. If the recall is 75%, it means that 75% of all real transitions in the audio have been found by the algorithm.

Separately, each metric is not a good indicator. Indeed, only maximizing the precision can lead to a program that expressively refuses to find transitions by fearing to be wrong. On the other side, only maximizing the recall will encourage the algorithm to find transitions everywhere to avoid missing one. F1 score is the combination of both metrics and ensures that the program is performant in both metrics.

Additionally, putting more weight on one metric can also be done to favor a certain behaviour. For instance, in our case, it could be relevant to put more weight on the precision so that the algorithm is more reliable at the cost of missing a bit more transitions.

To favor one metric (e.g., recall), a weighted F1 score can be used:

F_{1,\beta} = (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{\beta^2 \cdot Precision + Recall}

where \beta controls the weight for both metrics.

  • \beta = 1.0, both metrics weigh the same which is the standard F1 score
  • \beta < 1.0, the precision is prioritized over the recall
  • \beta > 1.0, the recall is prioritized over the precision

Optimization using Optuna

Now that a good metric has been found, some of the previous methods can be optimized by maximizing it. Indeed, Some techniques have multiple parameters and finding the optimal ones is not a trivial task. The proposed solution to tweak them intelligently is optuna (Akiba et al. 2019). Optuna is a hyperparameter optimization framework that allows to easily define an optimization problem and find the best parameters using various algorithms. In our case, the Tree-structured Parzen Estimator (TPE) algorithm has been used to optimize the parameters of the novelty-based segmentation methods. The optimization is done by running multiple trials, each trial corresponding to a different set of parameters. After a predefined number of trials, the best parameters are selected based on the highest F1 score obtained.

Example plot that evaluates the predicition of the segmentation of the chromagram novelty curve approach.

Results

Audio-Only Segmentation Results

The table below summarizes the F1 scores obtained by each audio-only segmentation method after optimization. The datasets used for this evaluation consist of three opera excerpts from the Ehrenreich collection, each with annotated structural boundaries found using scores. The best precision, recall, and F1 score for each method are highlighted in bold, while the second-best scores are underlined.

Method Precision¹ Recall¹ F1 Score¹
Audio-Only Segmentation
Silence Curve 0.32 0.81 0.45
Applause Curve (HRPS) ² 1.00 0.34 0.51
Novelty Curve (Chromagram) 0.51 0.65 0.54
Novelty Curve (MFCC) 0.53 0.69 0.56
Novelty Curve (Tempogram) 0.44 0.58 0.46
Novelty Curve (Combined) 0.51 0.80 0.60

¹ These metrics represent mean values across all evaluation windows, not overall metrics computed from cumulative TP, FP, FN counts. Each value can be interpreted as: “on an average window (15 minutes), what would be my Precision, Recall, or F1 Score”. This is why the F1 Score values are not exactly equal to: \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} due to the averaging process across windows.

The real F1 score is then: W^{-1} \cdot \sum_{w}^{W}(\frac{2 \cdot \text{Precision}_w \cdot \text{Recall}_w}{\text{Precision}_w + \text{Recall}_w})

² HRPS applause curve results are not directly comparable to other methods since only one of the three evaluation excerpts contains applause segments. While the F1 score appears reasonable, applying this algorithm to excerpts without applause would be meaningless. However, these results demonstrate that HRPS-based segmentation can be highly effective when properly calibrated for recordings containing applause.

Alignment Segmentation Results

Regarding the segment alignment method and similar to HRPS applause curve, the results are not directly comparable to audio-only segmentation methods since not all excerpts were accessible/relevant on the naxos.com database. Therefore, the following table shows results of one excerpt only, namely Giulio Cesare by Handel (BAR103, track 2, channel 2).

The alignment results are compared against a simple baseline method that estimates segment start times using only the duration ratio between preview segments and the full opera recording, without any chromagram-based refinement.

Results are presented both with and without a post-processing correction step (+ Correction). This correction step addresses a data mismatch issue: the Naxos database contained fewer preview segments than were annotated in the musical score. For example, Naxos might list 60 opera segments while the score annotations marked 70 structural boundaries. This discrepancy led to degraded recall during evaluation, as the algorithm could only detect segments that existed in the Naxos data, missing valid structural boundaries that were annotated in the score but had no corresponding Naxos preview. The correction step accounts for these missing segments to provide a fairer comparison against the available score-based ground truth.

Method Precision Recall F1 Score
Segment Alignment
Baseline Alignment 0.38 0.32 0.35
+ Correction 0.38 0.38 0.38
Feature-Based Alignment 0.70 0.55 0.62
+ Correction 0.70 0.65 0.67

Discussion

Performance Analysis and Method Comparison

The evaluation results reveal distinct performance characteristics across different segmentation approaches, each with specific strengths and limitations that have important implications for practical opera analysis.

Audio-Only Segmentation Methods

The combined novelty curve approach emerges as the most effective audio-only method, achieving the highest F1 score (0.60) by successfully balancing precision and recall. This superior performance validates the core hypothesis that different audio features capture complementary aspects of musical structure—harmonic transitions (chromagram), timbral changes (MFCC), and rhythmic variations (tempogram). The Optuna optimization results further support this, consistently favoring chromagram and MFCC features over tempogram features, suggesting that harmonic and timbral information provide more reliable indicators of opera movement boundaries than rhythmic patterns.

Interestingly, the silence curve method demonstrates the highest recall (0.81) among all approaches, indicating exceptional capability in detecting actual structural boundaries. However, its low precision (0.32) reveals a tendency toward over-segmentation, detecting many non-structural silences. This behavior is particularly relevant for opera recordings, where brief pauses for breathing, dramatic effect, or applause transitions may not correspond to true movement boundaries. For musicologists prioritizing comprehensive boundary detection over precision, this method could serve as an initial screening tool.

The individual novelty curve methods show moderate but consistent performance across different feature types. MFCC-based segmentation achieves the highest precision (0.53), likely due to its sensitivity to orchestral and vocal texture changes that often accompany major structural transitions. Chromagram-based methods perform well in detecting harmonic transitions, while tempogram-based approaches show the lowest overall performance, possibly because opera transitions often maintain rhythmic continuity despite significant harmonic and timbral changes.

Segment Alignment Approach

The feature-based alignment method demonstrates promising results when compared to naive baseline approaches, improving F1 scores from 0.35 to 0.67 with correction. This substantial improvement validates the chromagram-based Dynamic Time Warping (DTW) approach for aligning preview segments with full recordings. The correction mechanism proves crucial, improving recall from 0.55 to 0.65 by accounting for data mismatch issues between Naxos preview availability and score annotations.

The method achieves a precision of 0.70, surpassing the 0.7 threshold for reliable boundary detection. This indicates that identified boundaries have minimal false positives, making the approach suitable for applications requiring confident structural boundary detection.

However, the alignment approach faces inherent scalability limitations. The requirement for external segment data (Naxos previews) restricts its applicability to well-documented operas with available commercial segments. This dependency makes the method less suitable for the broader Ehrenreich Collection, which contains many rare and bootleg recordings lacking corresponding preview segments.

Methodological Limitations and Evaluation Challenges

Several factors limit the generalizability of these results. The evaluation dataset consists of only three opera excerpts, potentially insufficient for capturing the full diversity of structural patterns across different composers, periods, and performance styles. The ground truth annotations, derived from musical scores, may not always reflect the acoustic reality of live performances, particularly in bootleg recordings where interpretive decisions and performance conditions can alter structural boundaries.

The data mismatch issues highlighted in the alignment evaluation underscore a broader challenge in computational musicology: the discrepancy between theoretical musical structure (score annotations) and practical segmentation needs (available preview segments). This challenge is particularly acute for historical recordings where comprehensive metadata may be incomplete or unavailable.

Practical Implications for Opera Analysis

For musicologists and researchers working with the Ehrenreich Collection, these results suggest a hybrid approach combining multiple methods based on specific research needs:

  • Comprehensive boundary detection: Use silence curve methods for initial screening, accepting higher false positive rates to minimize missed boundaries
  • Precise structural analysis: Apply combined novelty curves for balanced precision-recall performance when analyzing specific opera excerpts
  • Comparative studies: Employ segment alignment methods when working with well-documented operas having available preview segments

Conclusion and Future Work

Summary of Key Findings

This project successfully demonstrates that computational segmentation of opera recordings is feasible using established music information retrieval techniques, with several methods showing promising performance for practical musicological applications.

The combined novelty curve approach represents the most significant contribution, achieving an F1 score of 0.60 by intelligently fusing chromagram, MFCC, and tempogram features. This method consistently outperformed individual feature-based approaches, validating the hypothesis that different audio characteristics capture complementary aspects of musical structure. The Optuna-optimized feature weights confirm that harmonic (chromagram) and timbral (MFCC) information provide more reliable structural indicators than rhythmic patterns for opera segmentation.

The interactive PyQt6 application provides researchers with immediate access to these methods, enabling real-time parameter adjustment and comparative analysis. This tool bridges the gap between algorithmic research and practical musicological workflows, allowing domain experts to fine-tune segmentation approaches for specific recordings or research questions.

Future Research Directions

Machine Learning and Self-Supervised Approaches

Current methods rely on hand-crafted features and traditional signal processing approaches, which, while effective, may not capture the full complexity of musical structure. During this project, initial exploration was conducted into self-supervised deep learning approaches that could leverage the vast unannotated audio content of the Ehrenreich Collection for pretraining, followed by fine-tuning on smaller annotated datasets.

The proposed approach centers on contrastive learning between audio segments, where the model learns meaningful representations by comparing spectral features of different audio windows. Unlike raw audio processing, this method would operate on audio features (e.g. spectrograms), which provide a more stable and less noisy representation suitable for structural analysis tasks. The self-supervised pretraining phase would utilize the thousands of hours available in the Ehrenreich Collection, exposing the model to diverse opera styles, composers, and performance conditions to learn general musical structural patterns.

For practical implementation, the approach envisions sliding window classification using 5-second audio windows, where each window receives a continuous probability score (0-1) indicating the likelihood of containing a structural boundary. This granular approach would enable smooth, probabilistic segmentation by iterating over entire recordings and aggregating window-level predictions. The trained model could be directly integrated into the PyQt6 application, providing users with an additional machine learning module alongside existing signal processing methods.

Initial conceptual work focused on binary classification frameworks where the model distinguishes between transition and non-transition windows, but preliminary exploration suggested that continuous probability outputs offer more nuanced and interpretable results for musicological analysis. However, the substantial computational requirements for self-supervised pretraining, combined with the time-intensive process of acquiring diverse annotated opera datasets for fine-tuning, led to this approach being deferred in favor of the implemented traditional methods.

Future work could build upon this foundation by exploring modern self-supervised architectures like TS-TCC (Eldele et al. 2021) or music-specific models like MERT (Yizhi et al. 2023), adapted for the specific challenges of opera segmentation. The integration of such approaches would significantly enhance the application’s capabilities, potentially achieving superior performance through learned representations rather than hand-crafted features.

References

Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. “Optuna: A Next-Generation Hyperparameter Optimization Framework.” https://optuna.org/.
Bern University of Applied Sciences. 2024. “The Ehrenreich Collection Online Platform.” https://web.ehrenreich.bfh.science/.
Demule, Hugo. 2025. “Ehrenreich Collection Semester Project — Source Code.” https://gitlab.epfl.ch/rammos/ehrenreich-collection-semester-project.
Driedger, Jonathan, Meinard Müller, and Sascha Disch. 2014. “Extending Harmonic-Percussive Separation of Audio Signals.” In Ismir, 611–16.
Eldele, Emadeldeen, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. 2021. “Time-Series Representation Learning via Temporal and Contextual Contrasting.” arXiv Preprint arXiv:2106.14112.
Fitzgerald, Derry. 2010. “Harmonic/Percussive Separation Using Median Filtering.”
Hochschule der Künste Bern. 2021. “Ehrenreich Collection — Identity, Voice, Sound.” https://www.hkb-interpretation.ch/projekte/ehrenreich-collection.
Müller, Meinard. 2015. Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications. Cham: Springer. https://doi.org/10.1007/978-3-319-21945-5.
Riverbank Computing Limited. 2021. “PyQt6: Python Bindings for the Qt Cross Platform Application Toolkit.” https://pypi.org/project/PyQt6/.
Yizhi, LI, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, et al. 2023. “MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training.” In The Twelfth International Conference on Learning Representations.